ML Course Part 2 - Introduction to Deep Learning
1 Introduction
2 Definitions (Reminders of Previous Session)
2.1 Categories of ML
2.1.1 Type of dataset
- Supervised: for each input in the dataset, the expected output is also part of the dataset
- Unsupervised: for each input in the dataset, the expected output is not part of the dataset
- Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
- Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions
2.1.2 Type of output
- Classification: assigning one (or multiple) label(s) chosen from a given list of classes to each element of the input
- Regression: assigning one (or multiple) value(s) chosen from a continuous set of values
- Clustering: create categories by grouping together similar inputs
2.2 Dataset
2.2.1 Dataset
Dataset
A collection of data used to train, validate and test ML models.
2.2.2 Content
Instance (or sample)
An instance is one individual entry of the dataset.
Feature (or attribute or variable)
A feature is a type of information stored in the dataset about each instance.
Label (or target or output or class)
A label is a piece of information that the model must learn to predict.
2.2.3 Subsets
Dataset subsets
A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:
- Training set: used during training to train the model,
- Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
- Test set: used after training to evaluate the performance of the model on new data it has not encountered before.
3 Structure of NN
3.1 Definitions
3.1.1 Neural Network
Neural Network (NN)
Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.
3.1.2 Deep Learning
Deep Learning (DL)
Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.
3.2 Structure of a NN model
3.2.1 Basic Elements
Neuron
Takes multiple inputs, sums them with weights and passes the result as output.
Layer
Set of similar neurons taking different inputs and/or having different weights.
Neural Network (NN)
Sequence of layers.
3.2.2 Linear Functions
Linear Function
Function that can be written like this: \[ f(\alpha_1, \cdots, \alpha_n) = (\beta_{1,1} \alpha_1 + \cdots + \beta_{1,n} \alpha_n, \cdots, \beta_{m,1} \alpha_1 + \cdots + \beta_{m,n} \alpha_n) \]
Composition of Linear Functions
The composition of any number of linear functions is a linear function.
3.2.3 Activation Functions - Definition
Activation Function
Function applied to the output of a NN layer (i.e. to the output of each of its neurons) to introduce non-linearity to the model.
Activation functions allow to approximate much more complex functions, using a sequence of intertwined affine layers and activation layers.
3.2.4 Activation Functions - Examples
3.3 Affine layers
3.3.1 Fully Connected
The most basic layer, in which each output is a linear combination of each input (before the activation layer)
3.3.2 Convolutional
A layer combining geographically close features, used a lot to process rasters.
3.3.3 Recurrent
Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.
The two main variants of recurrent layers are:
Nowadays, transformer architectures are however preferred to process sequential data.
3.3.4 Pooling
A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.
3.3.5 Residual
A Residual Block aims at stabilizing training and convergence of deep neural networks (with a large number of layers), by adding the input of a given layer to the output of another layer further down in the architecture.
3.3.6 Attention
Attention aims at determining relative importance of each part of the input to make better predictions. It is used a lot in natural language processing (NLP) and image processing.
3.3.7 And a lot more…
- Dropout: randomly drop out some of the nodes during training to reduce overfitting
- Batch Normalization: normalize the input of each layer across the batch to improve training stability and speed
- Layer Normalization: normalize the input of each layer across the features to improve training stability and speed
- Embedding: transforms discrete input data into continuous vectors with lower-dimensional space
- Flatten: convert multi-dimensional data into 1D data that can be fed into fully connected layers
- …
3.4 Architectures of NN
3.4.1 Fully Connected Network (FCN)
Most simple kind of NN models, quite general.
In more complex models, FC layers are often used at the end of the model to put together extracted features and get the final predictions.
3.4.2 Convolutional Neural Network (CNN)
Standard for image processing, with 2D convolutional layers followed by fully connected layers.
Idea: extract bigger and bigger patterns incrementally by processing together nearby cells.
Can also be used for videos, audio or text
Requires lots of data to train.
3.4.3 A Lot More
- Recurrent Neural Network (RNN)
- Generative Adversarial Network (GAN)
- Autoencoder Network
- Transformer Network
3.5 Examples
3.5.1 Model
3.5.2 PyTorch
import torch
import torch.nn as nn
# Simple fully connected model with 2 hidden layers
class SimpleMLP(nn.Module):
def __init__(self):
super(SimpleMLP, self).__init__()
self.fc1 = nn.Linear(2, 20) # Input layer to 1st hidden layer
self.fc2 = nn.Linear(20, 10) # 1st hidden layer to 2nd hidden layer
self.fc3 = nn.Linear(10, 1) # 2nd hidden layer to output layer
def forward(self, x):
x = torch.relu(self.fc1(x)) # ReLU activation after first layer
x = torch.relu(self.fc2(x)) # ReLU activation after second layer
x = self.fc3(x) # Output layer (no activation for regression tasks)
return x3.5.3 Tensorflow + Keras
import tensorflow as tf
from tensorflow.keras import layers, Model
# Simple fully connected model with 2 hidden layers
class SimpleMLP(Model):
def __init__(self):
super(SimpleMLP, self).__init__()
self.fc1 = layers.Dense(20, activation='relu') # Input layer to 1st hidden layer
self.fc2 = layers.Dense(10, activation='relu') # 1st hidden layer to 2nd hidden layer
self.fc3 = layers.Dense(1) # 2nd hidden layer to output layer (no activation for regression)
def call(self, inputs):
x = self.fc1(inputs)
x = self.fc2(x)
return self.fc3(x)4 Optimization
4.0.1 Why Optimization?
The number of possible combinations of parameters is huge, even with small NN models. To (hopefully) find the best possible combination, we need two things:
- A way to evaluate any combination
- A way to find well-performing combinations
4.1 Loss Function
4.1.1 Definition
A loss function is a mathematical function that quantifies the difference between the network’s predicted output and the actual target values. The goal during training is to minimize this loss by adjusting the model’s weights, using gradient descent.
The most common loss functions are:
- Mean Squares Error (MSE) for regression
- Cross-entropy Loss for classification
4.1.2 Differentiable
To be able to perform gradient descent, the loss function must be differentiable, which means continuous (no jump) and smooth (no sudden change of direction).
4.1.3 Convex
To get the best results when performing gradient descent, it is also better if the function is convex. The simplest definition of convexity is that if you trace a straight line between two points on the curve, the curve will be below the segment between the two points.
4.2 Gradient Descent
4.2.1 Definition
Gradient Descent
The process of iteratively computing and following the direction of the gradient of a function to (hopefully) reach the minimum value of the function (if it exists).
Gradient Descent works because at any point in the definition space of the function, the gradient points in the direction of the steepest angle. So locally, following this direction is the quickest way to get to the lower value of the function. If we come back to the requirements listed before:
- Differentiable functions are functions where the gradient exists everywhere
- Convex functions are convenient for gradient descent because they have only one minimum value and slowly going down the function will always lead to the minimum value.
4.2.2 Algorithm
Gradient Descent boils down to iteratively:
- Compute the gradient of the loss function at the current point
- Make a step towards the direction of the gradient to a new point
- Repeat step 1 until we stop
In this process, the three things that have to be defined are:
- The starting point (weights initialization)
- The size of the steps (learning rate)
- The condition to stop
4.2.3 Weights initialization
The starting point is defined by the first output of the model, and therefore by the initial values of the weights of the model. There are numerous methods to initialize the weights, but the most common one is to randomly initialize them using a centered and normalize Gaussian distribution.
4.2.4 Learning rate
The gradient gives us a direction and a norm, but this norm is arbitrary and has to be rescaled using what we call the learning rate. The learning rate doesn’t define the size of the steps, but the scalar factor to apply to the gradient’s norm, which means that the norm still plays a crucial role.
The choice of the learning rate is crucial to hopefully converge quickly to the global minimum loss.
4.2.5 Stop condition
The stop condition determines when you decide to stop the algorithm. An easy solution is to choose a number of steps before launching the algorithm, but this will either imply useless computations after the algorithm has reached a final point, or stopping too early and not get the best results possible.
Therefore, although there are more complex methods, the most common and simple process is to monitor the value of the loss, memorize the lowest value ever reached, and stop when there has been a given number of steps without any improvement to the best value. Then, we usually keep the model weights corresponding to this best value.
4.2.6 Unlucky examples
4.3 Backpropagation
4.3.1 Why?
Gradient Descent is beautiful, but right now, we only know in which direction (the gradient) the output of the model should go. To transmit this information to the weights of the layers of the model, we use backpropagation.
4.3.2 Definition
Backpropagation
The process of computing the gradient of the weights of each layer of the model and modify them accordingly. The name comes from the process starting with the last layer of the model and propagating incrementally to the first layer.
4.3.3 How?
Backpropagation involves a lot of computations of partial derivatives, which are individually not difficult but very bothersome. Happily, NN libraries handle backpropagation automatically by calling only one function, so no need to worry about it.
4.4 Examples
4.4.1 PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Generate some simple data
X = torch.randn(100, 2) # 100 samples, 2 features
y = (2 * X[:, 0]**3 + 0.5 * X[:, 1]**2 + torch.randn(100) * 0.5).unsqueeze(1)
# Create a DataLoader for batching
batch_size = 32
dataloader = DataLoader(TensorDataset(X, y), batch_size=batch_size, shuffle=True)
# Instantiate the model
model = SimpleMLP() # A simple model that is assumed to be defined before
# Define a loss function and optimizer
criterion = nn.MSELoss() # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01) # Stochastic Gradient Descent
# Training loop
epochs = 1000
epoch_losses = [] # List to store average loss at each epoch
for epoch in range(epochs):
epoch_loss = 0.0 # Accumulate loss for this epoch
for batch_X, batch_y in dataloader:
# Forward pass
predictions = model(batch_X) # Compute the model's predictions
batch_loss = criterion(predictions, batch_y) # Compare predictions to true values
# Backward pass and optimization
optimizer.zero_grad() # Reset gradient
batch_loss.backward() # Backward pass
optimizer.step() # Update parameters
# Accumulate weighted loss for the batch
epoch_loss += batch_loss.item() * batch_X.size(0)
epoch_loss /= len(dataloader.dataset) # Normalize by dataset size to get average epoch loss
epoch_losses.append(epoch_loss) # Store the epoch's total loss
# Print loss every 10 epochs
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch + 1:>4}/{epochs}, Loss: {epoch_loss:.4f}")4.4.2 Tensorflow + Keras
import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np
# Generate some simple data
X = np.random.randn(100, 2).astype(np.float32) # 100 samples, 2 features
y = (2 * X[:, 0]**3 + 0.5 * X[:, 1]**2 + np.random.randn(100) * 0.5).astype(np.float32)
y = y.reshape(-1, 1) # Reshape y to (100, 1)
# Create a Dataset for batching
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((X, y)).batch(batch_size)
# Instantiate the model
model = SimpleMLP() # A simple model that is assumed to be defined before
# Define a loss function and optimizer
loss_fn = tf.keras.losses.MeanSquaredError() # Mean Squared Error Loss
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01) # Stochastic Gradient Descent
# Training loop
epochs = 1000
epoch_losses = [] # List to store average loss at each epoch
for epoch in range(epochs):
epoch_loss = 0.0 # Accumulate loss for this epoch
for batch_X, batch_y in dataset:
# Forward pass
with tf.GradientTape() as tape:
predictions = model(batch_X) # Compute the model's predictions
batch_loss = loss_fn(batch_y, predictions) # Compare predictions to true values
# Backward pass and optimization
gradients = tape.gradient(loss, model.trainable_variables) # Backward pass
optimizer.apply_gradients(zip(gradients, model.trainable_variables)) # Update parameters
# Accumulate weighted loss for the batch
epoch_loss += batch_loss.numpy() * len(batch_X)
epoch_loss /= len(X) # Normalize by dataset size to get average epoch loss
epoch_losses.append(epoch_loss) # Store the epoch's total loss
# Print loss every 10 epochs
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch + 1:>4}/{epochs}, Loss: {epoch_loss:.4f}")5 Transfer Learning
5.1 Introduction
5.1.1 Principle
Start Closer to the Goal
The basic idea of Transfer Learning is to use a model that was pre-trained on a similar task. This initial model should have learnt basic knowledge that is common to its initial task and our new task, making it capable of learning the new task faster.
It is easier to learn German knowing French compared to not knowing any language. But it’s easier to learn German knowing Dutch compared to knowing French. And it’s even easier if you know Dutch, English, French and 15 other languages.
5.1.2 Reasons
- Limited amount of data for the new task
- New task is similar to old task
- Training a model is very costly
- Better final performance with less overfitting
5.1.3 Applications
Two major applications:
- Computer Vision
- Natural Language Processing
Because CV and NLP are two domains with:
- Very complex tasks and therefore deep models
- A lot of attention and therefore lots of good pre-trained models
5.2 Categories
5.2.1 Fine-tuning
Idea
Take a pre-trained model as is, freeze some of the layers (usually the first layers) and continue the training where it was stopped with our new dataset.
Challenges
- Find a good pre-trained model
- Have a proper dataset (even if fine-tuning works with smaller datasets)
- Freeze the right number of layers (more if the tasks are similar)
- Train the model properly (useful to know how it was pre-trained)
5.2.2 Feature Extraction
5.2.3 Multitask Learning
5.2.4 Knowledge Distillation
6 Python Libraries
6.0.1 PyTorch
- Website:
- Flexibility, ease of debugging, beginner-friendly
- Favored in academia and by researchers
6.0.2 TensorFlow
- Website:
- Scalable, production-ready and easy deployment
- Favored in production and industry
6.0.3 Keras
- Website:
- User-friendly, high-level and powerful with TensorFlow integration
- Favored for first hands-on experience for beginners and for prototyping